Skip to content

feat(manifest): add ManifestFilterManager and ManifestMergeManager#652

Open
gty404 wants to merge 1 commit into
apache:mainfrom
gty404:manifest-manager
Open

feat(manifest): add ManifestFilterManager and ManifestMergeManager#652
gty404 wants to merge 1 commit into
apache:mainfrom
gty404:manifest-manager

Conversation

@gty404
Copy link
Copy Markdown
Contributor

@gty404 gty404 commented May 12, 2026

Implement two manifest management classes for table write operations:

  • ManifestFilterManager: filters manifest entries by row filter expression, file path, or partition value; supports FailMissingDeletePaths validation. Rewrites manifests that contain matching files, marking entries as DELETED; passes through manifests that cannot contain matching files unchanged.

  • ManifestMergeManager: merges small manifests using greedy bin-packing, grouping by partition_spec_id (manifests with different specs are never merged). Oversized manifests pass through unchanged. ADDED entries from prior manifests become EXISTING when merged (matching Java semantics).

Implement two manifest management classes for table write operations:

- ManifestFilterManager: filters manifest entries by row filter expression,
  file path, or partition value; supports FailMissingDeletePaths validation.
  Rewrites manifests that contain matching files, marking entries as DELETED;
  passes through manifests that cannot contain matching files unchanged.

- ManifestMergeManager: merges small manifests using greedy bin-packing,
  grouping by partition_spec_id (manifests with different specs are never
  merged). Oversized manifests pass through unchanged. ADDED entries from
  prior manifests become EXISTING when merged (matching Java semantics).
@gty404 gty404 force-pushed the manifest-manager branch from 4a08049 to f360476 Compare May 12, 2026 09:53
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java parity review: the overall structure looks reasonable, but a few details are not yet aligned with Java ManifestFilterManager/ManifestMergeManager semantics.

for (const auto& de : delete_exprs_) {
ICEBERG_ASSIGN_OR_RAISE(auto* eval, GetMetricsEvaluator(metadata, spec_id, de));
ICEBERG_ASSIGN_OR_RAISE(auto matches, eval->Evaluate(file));
if (matches) return true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This row-filter delete path is broader than Java's behavior. Java's ManifestFilterManager first computes a partition residual and then uses both inclusive and strict metrics: files are deleted only when rowsMustMatch is true, and data files where only some rows may match raise Cannot delete file where some, but not all, rows match filter. Here, an inclusive might match result is enough to mark the whole file as DELETED, so a file whose bounds straddle the predicate would be incorrectly removed instead of rejected.

const TableMetadata& metadata, const std::shared_ptr<Snapshot>& base_snapshot,
const ManifestWriterFactory& writer_factory) {
// No base snapshot → nothing to filter
if (!base_snapshot) return std::vector<ManifestFile>{};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java still validates required deletes when the manifest list is null or empty: filterManifests calls validateRequiredDeletes() before returning. With FailMissingDeletePaths() set, this early return silently succeeds for a missing path on an empty/no-base snapshot, while Java reports Missing required files to delete.

bool ManifestFilterManager::CanContainDeletedFiles(const ManifestFile& manifest,
const TableMetadata& metadata) {
// A manifest with no live files cannot contain files to delete.
bool has_live = (manifest.added_files_count.value_or(0) > 0) ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This treats missing manifest counts as zero live files. In Iceberg/Java, null addedFilesCount or existingFilesCount means the count is unknown and ManifestFile.hasAddedFiles/hasExistingFiles returns true. Skipping such manifests can miss deletes for older or compatible metadata where counts are not populated.

all.insert(all.end(), new_manifests.begin(), new_manifests.end());
all.insert(all.end(), existing_manifests.begin(), existing_manifests.end());

if (!merge_enabled_ || std::cmp_less(all.size(), min_count_to_merge_)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This global min_count_to_merge_ check does not match Java. Java only applies minCountToMerge to the bin containing the first/newest manifest; older bins that do not contain first can still be merged even when the total input count is below the threshold. Returning all manifests here can skip merges Java would perform.

std::vector<ManifestFile> result;

for (const auto& manifest : group) {
if (manifest.manifest_length >= target_size_bytes_) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java uses ListPacker.packEnd(group, ManifestFile::length) with lookback 1 specifically to preserve manifest order and leave the under-filled bin at the beginning. This forward hand-rolled packing keeps the current small bin open across an oversized manifest, so a sequence like [small, oversized, small] can merge the two small manifests across the oversized one and reorder output relative to Java.

const ManifestFile& first = all[0];

// Group manifests by partition_spec_id — never merge across specs
std::map<int32_t, std::vector<ManifestFile>> by_spec;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java groups manifests by partition spec using a reverse-order TreeMap. This std::map iterates specs in ascending order, so the merged manifest list order can differ from Java. For v3 tables this ordering is observable because manifest-list writing assigns first-row IDs in output order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants